Index:
Introduction
Methodology
Data
Data Preprocessing
Visual Data Exploration
Predictive Modelling
Distance Based Model - K Nearest Neighbours
Probability Based Model - Naive Bayes
Information Based Model - Decision Tree
Information Based Model - Random Forrest
Final Comparison of Best Models
Summary and Further Steps Forward
Reference
Dating is hard. Especially in the modern age where everyone is overloaded with information but time poor. Many companies and startup try to make this problem easier with matching services. Dating apps have grown to be a large and lucrative industry. Therefore optimizing the user experience is essential to be successful in being ahead of everyone else.
So the two things people consider when using an app.
Our aim with this project is to identify correct matches for participants. Since no one likes to fill out forms. Especially about intimate information. We will try to identify a list of 10 - 20 most important questions required for the machine learning models to be accurate. We would like to model with as small number of inputs from the participants as possible and still achieve a good accuracy.
Therefore, our goal in this project is find
We will be testing out four machine learning models on the numerical data. Since we want to evaluate the models equally.We will loose too much information if we transform the categorical data into numerical.
The models we will evaluate will come from 3 different categorires. They are:
The reason we are choosing these models is for their training speed. Since social patterns will change constantly, the data will need to be constantly updated. Hence speed is important.
Another reason to choose these models are they are base representatives of their specific type of model. By evaluating the base models, we can gain an understanding on what is the best type of model to use for prediction. When we determined the particular model type, we can apply more advanced models from that type to improve accuracy. However that will be beyond the scope of this project.
We will split the data into 'Train' and 'Test' set. Since we have a big enough data set (8378 rows) we split the train and test data 70% and 30%. We will perform all hyperparameter tuning on the Train set to get the optimal parameters. Only then will the model will be used on the test set. This will ensure zero information leakage between the train and test. This can also best evaluate the model performance.
We will be using the F1 score as our model evaluation metric. F1 score is the harmonic score of precision and recall. We are using this score because our data is imbalanced. Optimising F1 score is best when we have class imbalancce
For each model:
Create a pipeline and list of parameters, run a cross validation and find the best parameters that optimize the F1 score.
Conduct feature selection via the F Score method and the Random Forrest Classifier features selection method to determine which set of features are required for match prediction.
Using the classifiers from both feature selection methods to predict both the training and test set. Then look at the results to see of overfitting takes place.
Evaluate models by the confusion matrix, receiver operation characteristics (ROC) curve.
Check which feature selection method is better by conducting a paired t-test on the cross validation results from two selection methods. This will be used to determine which set of features is better for predicting. If there is no significant difference, the the smaller set of features will be used and recommended.
Rabbit hole
Since we know that decision tree models work better on categorical data. And since the data comes already with numerical features binned into categorical data. We can also test if there is significant improvments in using categorical data on decision tree based models.
Evaluation
Once all the models with their cross validation results are obtained. We will then compare the cross validation results of all of the models to determine with is the best model to use.
The best model will be recommended with the set of features. The final set of features is then the best set of features to be used with the final model to predict match for speed dating.
The data is from OpenML was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
# so that we can see all the columns
pd.set_option('display.max_columns', None)
# set seed for reproducibility of results
np.random.seed(88)
Missing values recorded as '?'
speeddating = pd.read_csv('speeddating.csv', na_values = '?')
Checking if the data is correctly imported
speeddating.shape
speeddating.head()
has_null contains information on wether the line items have null values.
wave contains 1-21 which are the group numbers of the participants. They are redundant features and need to be removed as they are of no visual or predictive importance.
speeddating.drop(columns =['has_null', 'wave'], inplace = True)
Here is a list of all columns
print(list(speeddating.columns))
It appears that the data has already been preprocessed with numerical columns binned with into categorical columns, we will separate this into a new data frame which could be used for predictive modelling later and all other categorical columns will be droped.
#filter for all columns starting with d_
filter_col = [col for col in speeddating if col.startswith('d_')]
df_d = speeddating[filter_col].drop(columns = 'd_age').copy()
df_d.head()
for col in df_d:
print(df_d[col].value_counts())
There does not appear to be any strange or weird values in the categorical columns.
col_num = speeddating.columns[speeddating.dtypes!='object']
df_num = speeddating[col_num]
df_num.head()
Basic descriptive statistics for the numerical data
df_num.describe()
We will use the numerical data for predictive modelling since they offer more information. We will use categorical data to test information base algorithms as they do perform better. However due to the loss of information from binning the numerical data into categories, there will be information lost. This could potentially affect the performance.
print("Total Number of missing values in the numerical data is",df_num.isna().sum().sum())
df_num.isna().sum()
The numerical data has multiple missing values in different columns
print("Total Number of missing values in the categorical data is",df_d.isna().sum().sum())
df_d.isna().sum()
The categorical(binned) features are void of any empty values
Missing values observed in the numerical data will be imputed with the mean.
for col in df_num:
df_num[col] = df_num[col].fillna(df_num[col].mean())
Recalculate d_age since we notice that previously, if age = na and age_o = 24 then d_age = 24. This cause some large numbers.
df_num['d_age']=abs(df_num['age']- df_num['age_o'])
df_num.describe()
Here we conduct a final check to make sure all missing values are handled
print("Total Number of missing values in the numerical data is",df_num.isna().sum().sum())
Now we have 2 dataframes
There are no unusual or missing values in any of the columns and everything has been handled
We will try to explore the different variables present in the data and try to find relationships and trends present which would be helpful to our research question, this visual analysis is conducted on a small subset of the original data to reduce the size of graphs output. While we could be doing it on a larger sample, the resulting graphs will crash the notebook.
datavis = df_num.sample(100, random_state = 88)
print("Target variable in the orignial data")
df_num['match'].value_counts(normalize=True).round(3)
print("Target variable in the sample data")
datavis['match'].value_counts(normalize=True)
It is verified that the sample data is representative of the original data
A histogram of the target variable
datavis['match'] = datavis['match'].replace({0:'No', 1:'Yes'})
import altair as alt
chart=alt.Chart(datavis, width = 200, title = 'Histogram of Match(Target Data)').mark_bar().encode(x = 'match',
y = 'count()')
text = chart.mark_text(dy=-5).encode(
alt.X('match', title='Match', sort=None),
alt.Y('count()', title='Count of records'),
text='count()'
)
chart+text
Close to 90% of dates the result is no match, which shows how difficult it is to get a right match. This also is an indication of the imbalance in our data. We need to be aware of this and manage it appropriately in the modelling process.
We want to ask the question "How much do people preferring attractiveness ?"
chart=alt.Chart(datavis, width = 300, title = 'Histogram of Prefer attractive').mark_bar(
).encode(alt.X('pref_o_attractive', bin=True,title="Preference of attractiveness"),
alt.Y('count()'))
chart
Nearly 4/5th of the respondents reported to prefer attractivness less than 30%, apparently attractivness is not one of the top qualities people look for.
We also want to ask: "How much do people preferring sincerity ?"
alt.Chart(datavis, width = 300, title = 'Histogram of Preference of sincerity').mark_bar(
).encode(alt.X('pref_o_sincere', bin=alt.Bin(maxbins=12),title ="Preference of sincerity"),
alt.Y('count()'))
Nearly three quarters of the respondents said they preference of a sincere partner is 15-30%
Funny is one of the qualities that everyone seems to want. So next, we would like to see "How much do people preferring their partner being funny ?"
alt.Chart(datavis, width = 300, title = 'Histogram of Prefer Funny').mark_bar(
).encode(alt.X('pref_o_funny', bin=True),
alt.Y('count()'))
Well, looks like more people to like their partner being funny. As can be noticed from the left skew of the data. More value is on the top end.
To get an idea of each of the features, we produced a group of histogram for all features in the data
This can give us some understanding of how the data is distributed for each of the features.
chart = alt.hconcat()
for col in datavis.columns:
base = alt.Chart(datavis, width = 200).mark_bar(
).encode(alt.X(col),
alt.Y('count()'))
chart |= base
chart
Boxplot to see if getting a match depends on age difference in partners
Let's look at a box plot to see if getting a match depends on age difference in partners
alt.Chart(datavis, title = 'Age Difference make a difference in Match', width = 200).mark_boxplot().encode(
x='match',
y='d_age:Q',
color = 'match'
)
Apparently the difference in age is slightly important, as can be seen by above boxplot, median difference in age for a match is 2 and for non match is 3
A box plot to see if getting a match depends on correlation of partner interests
alt.Chart(datavis, title = 'Does interest correlation makes a difference in the match', width = 200).mark_boxplot().encode(
x='match',
y='interests_correlate:Q',
color = 'match'
)
This boxplot is quite interesting and shows it is very important for at-least half of the interests to correlate to get a match.
A heatmap to understand the decision making process at the event night
heat = alt.Chart(datavis, width = 200, height = 200,
title = 'Decision matrix of the participants').mark_rect().encode(
x=alt.X('decision:O',title="Decision at event night ",axis=alt.Axis(labelAngle=360)),
y=alt.Y('decision_o:O',title ="Partners decision at event night" ),
color='count()'
)
text = heat.mark_text(baseline='middle').encode(text = 'count()', color = alt.value('orange'))
heat + text
Only when both decision are yes, then we have a match. Hence match only comes from (yes and yes) while (no,no), (yes,no) and (no, yes) will result in a no match. This is the cause of the class imbalance. Interestingly, the occurances where one party is interested and other party is not interested together sums up to more than both parties are not interested.
A bar chart to see difference in like between matches and non matches
We would expect that a match will occure if one party rate highly on the like
alt.Chart(datavis,
width = 300,height=100
).mark_bar(
).encode(
y='match',
x='mean(like)')
Mean liking is nearly 7/10 for being a match
Bar charts to see how happy different demographics expect to be meeting their dates in the speed dating event
sd_sub = speeddating.sample(100, random_state = 88)
ch1 = alt.Chart(sd_sub, title = 'Expected happiness meeting potential partners',
width = 200,height=100
).mark_bar(
).encode(
y='gender',
x=alt.X('mean(expected_happy_with_sd_people)' ),
)
ch2 = alt.Chart(sd_sub,
width = 300
).mark_bar(
).encode(
y='age:O',
x='mean(expected_happy_with_sd_people)',
)
alt.vconcat(ch1,ch2)
It's interesting what we see her
A scatter plot to understand partners consider having the same sex race and religion to be important
alt.Chart(datavis,
width = 300
).mark_point(
).encode(
x='importance_same_religion',
y='importance_same_race',
color = 'match',
size = 'count()'
).facet(column = 'match')
Those who got a match did not believe that having the same race and religion are both important, we can see higher numbers for low importance for both of these combined.
alt.Chart(sd_sub,
width = 200,title="Expected number of matches for male/females and their decision"
).mark_bar(
).encode(
x=alt.X('gender:O', title="" ,axis=alt.Axis(labelAngle=360)),
y=alt.Y('mean(expected_num_matches)',title="Average of expected number of likes"),
column=alt.Column('decision') )
This shows that men who decide on a date have a higher number of expected matches compared to females, infact when women have a lower number of expected matches when they decided on a date. This shows women are choosy and do not like to keep a lot of options.
alt.Chart(sd_sub,
width = 200,title="Guessing the probability that the partner liked you and their decision for male/females"
).mark_bar(
).encode(
x=alt.Y('gender', title="" ,axis=alt.Axis(labelAngle=360)),
y=alt.X('mean(guess_prob_liked)',title="Average of probablity that partner liked you "),
color=alt.Color('gender:N'),column='decision_o' )
Apparently females guess right more times than men, it can be seen here when women had said they their partner probably did not like them they were right. Men tend to assume that they were liked while actully they were not.
Another interesting insight is when men thought that their partner did not like them they were off.
Do women have a better intuition? This is something to think about :)
From the previous exploration. We see that categorical data have already been tranformed into numerical data. For example, 'race' of each party have been tranformed into 'same race' where 1 = when the race match and 0 is when the race doesn't match. Other categorical columns are binned transformation of the numerical columns.
To ensure consistency, we will evaluate models first on the numerical features. Since we will be using models such as KNN and Naive Bayes where numerical features are required.
Preparing the numerical data for machine learning.
#dropping the variables not needed for modelling and the target variable to create descriptive features
droplist = ['age','age_o', 'decision','decision_o', 'match']
data = df_num.drop(columns = droplist)
#saving the target feature in a separate pandas series
target = df_num['match']
colnames = data.columns
df_num.head()
Columns age and age_o is removed since we have already a feature age difference where it calculates the difference of age between the participants. This holds the same information. This is to avoid duplication.
Decision and decision_o together create the match feature. This is our target feature.
Hence they must be removed to avoid information leakage.
Min Max scale transform the data to make sure all values are between 0 and 1. This is to ensure equal weights on the features
from sklearn import preprocessing
data = preprocessing.MinMaxScaler().fit_transform(data)
data =pd.DataFrame(data)
data.columns = colnames
Checking that the data is transformed correctly
pd.DataFrame(data).describe()
We will split the data in a 70:30 ratio. We will be conducting model fitting, cross validation and hyper parameter tuning on the train set only. Only while evaluating the model we will use the test dataset. This is to ensure we do not let the model overlearn.
from sklearn.model_selection import train_test_split
D_train, D_test, t_train, t_test = \
train_test_split(data, target, test_size = 0.3,
stratify=target, shuffle=True, random_state=88)
Checking the shape of the test and train sets.
for x in [D_train, D_test, t_train, t_test]:
print(x.shape)
We will be testing out four machine learning models on the numerical data as mentioned earlier namely:
We will optimise the F1 score metric as mentioned earlier
Our model fitting process is as shown below:
Evaluate each model using cross validation optimizing the model parameters, number of features selected, feature selection methods and get one best model for each algorithm
If the best model is one of the tree models, we will then use the categorical data to see if fitting the data on binned features can significantly improve the performance (Do we keep this)
Custom functions to:
#function to plot the confusion matrix and classification report : input target and predictions
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn import metrics
def plot_confusion_matrix(Targets, Predictions):
#create the confusion matrix
cm = metrics.confusion_matrix(Targets, Predictions)
#visualise the confusion matrix
df_cm = pd.DataFrame(cm, ['No match','Match'], ['No match','Match'])
plt.figure(figsize = (5,4))
cmplot = sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}, fmt="d", cbar=False)
cmplot.set_title('Confusion Matrix')
cmplot.set_ylabel('Actual')
cmplot.set_xlabel('Prediction')
#show classification report also
classification_report = metrics.classification_report(Targets, Predictions)
print(classification_report)
return(cmplot, classification_report)
# Following function plots roc curve for the model. Input MODEL, TEST DATA and string (name of model with parameters)
import altair as alt
def plot_roc_curve(MODEL, TEST_DATA, string):
t_prob = MODEL.predict_proba(TEST_DATA)
fpr, tpr, _ = metrics.roc_curve(t_test, t_prob[:,1])
df = pd.DataFrame({'fpr': fpr, 'tpr': tpr})
main = "ROC Curve of " + string
base = alt.Chart(df,
title=main
).properties(width=300)
roc_curve = base.mark_line(point=True).encode(
alt.X('fpr', title='False Positive Rate (FPR)', sort=None),
alt.Y('tpr', title='True Positive Rate (TPR) (a.k.a Recall)'),)
roc_rule = base.mark_line(color='green').encode(
x='fpr',
y='fpr',
size=alt.value(2)
)
return((roc_curve + roc_rule))
# Following custom function format the search results of a pipeline as a Pandas data frame and sort by highest score
def get_search_results(gs):
def model_result(scores, params):
scores = {'mean_score': np.mean(scores),
'std_score': np.std(scores),
'min_score': np.min(scores),
'max_score': np.max(scores)}
return pd.Series({**params,**scores})
models = []
scores = []
for i in range(gs.n_splits_):
key = f"split{i}_test_score"
r = gs.cv_results_[key]
scores.append(r.reshape(-1,1))
all_scores = np.hstack(scores)
for p, s in zip(gs.cv_results_['params'], all_scores):
models.append((model_result(s, p)))
pipe_results = pd.concat(models, axis=1).T.sort_values(['mean_score'], ascending=False)
columns_first = ['mean_score', 'std_score', 'max_score', 'min_score']
columns = columns_first + [c for c in pipe_results.columns if c not in columns_first]
return pipe_results[columns]
#Following function Visualises the top k features selected
from sklearn import feature_selection as fs
def feature_selection_plot(DATA, TARGET, num_features = 20, feature_selection_method = ['f_score','mutual_information'][0]):
if feature_selection_method == 'f_score':
fs_method = fs.f_classif
color = 'orange'
elif feature_selection_method == 'mutual_information':
fs_method = fs.mutual_info_classif
color = 'blue'
#fitting the fscore to get the score
fs_fit= fs.SelectKBest(fs_method, k=num_features)
fs_fit.fit_transform(DATA, TARGET)
#get the indicies with the highest n number of features
fs_indices = np.argsort(np.nan_to_num(fs_fit.scores_))[::-1][0:num_features]
#get the column names of the highest scoring features
best_features = list(data.columns[fs_indices])
#get the scores of the highest features
scores = fs_fit.scores_[fs_indices]
df = pd.DataFrame({'features': best_features,
'importances': scores})
chart = alt.Chart(df,
width=300,
title=feature_selection_method + ' Feature Importances'
).mark_bar(opacity=0.75,
color=color).encode(
alt.Y('features', title='Feature', sort=None, axis=alt.AxisConfig(labelAngle=0)),
alt.X('importances', title='Importance')
)
return chart
from sklearn.base import BaseEstimator, TransformerMixin
#Random forrest selector function from [FeatureRanking.com](https://www.featureranking.com/tutorials/machine-learning-tutorials/case-study-predicting-income-status/) line 22
# custom function for RFI feature selection inside a pipeline
# here we use n_estimators=100
class RFIFeatureSelector(BaseEstimator, TransformerMixin):
# class constructor
# make sure class attributes end with a "_"
# per scikit-learn convention to avoid errors
def __init__(self, n_features_=10):
self.n_features_ = n_features_
self.fs_indices_ = None
# override the fit function
def fit(self, X, y):
from sklearn.ensemble import RandomForestClassifier
from numpy import argsort
model_rfi = RandomForestClassifier(n_estimators=100, random_state=999)
model_rfi.fit(X, y)
self.fs_indices_ = argsort(model_rfi.feature_importances_)[::-1][0:self.n_features_]
return self
# override the transform function
def transform(self, X, y=None):
return X[:, self.fs_indices_]
#This function visualise the top selected features with Random Forrest Classifier
from sklearn.ensemble import RandomForestClassifier
def RF_features_selection_plot(DATA, TARGET,
RFMODEL = RandomForestClassifier(n_estimators=100, random_state=999),
n_features = 10):
RFMODEL.fit(DATA, TARGET)
#get the sorted index
from numpy import argsort
rfi_fs_indices_ = argsort(RFMODEL.feature_importances_)[::-1][0:n_features]
#create an arrway with the column names
best_features = D_train.columns[rfi_fs_indices_]
#create an arrawy with features score
best_score = RFMODEL.feature_importances_[rfi_fs_indices_]
df = pd.DataFrame({'features': best_features,
'importances': best_score})
chart = alt.Chart(df,
width=300,
title='Random forrest selected top Feature Importances'
).mark_bar(opacity=0.75,
color='darkgreen').encode(
alt.Y('features', title='Feature', sort=None, axis=alt.AxisConfig(labelAngle=0)),
alt.X('importances', title='Importance')
)
return(chart)
Below we will define a pipleline that will take the following into account
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=2,
random_state=88)
# define a pipeline with two processes
# if you like, you can put MinMaxScaler() in the pipeline as well
pipe_KNN = Pipeline([('fselector', SelectKBest()),
('knn', KNeighborsClassifier())])
params_pipe_KNN = {'fselector__score_func': [f_classif],
'fselector__k': [10, 20, D_train.shape[1]],
'knn__n_neighbors': [1, 3, 5, 7, 11, 15, 21],
'knn__p': [1, 2]}
gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN,
param_grid=params_pipe_KNN,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs = -2)
gs_pipe_KNN.fit(D_train, t_train);
Following table shows the top 5 parameter results for KNN model on the training data
search_results_knn =get_search_results(gs_pipe_KNN)
rem = search_results_knn.columns[1:4]
search_results_knn.drop(columns=rem).head()
search_results_knn['fselector__score_func'][0].__name__
# checking the performance
alt.Chart(search_results_knn.drop(columns="fselector__score_func"),
title='KNN Performance Comparison'
).mark_line(point=True).encode(
alt.X('knn__n_neighbors', title='Number of Neighbors'),
alt.Y('mean_score', title='F1 Score', scale=alt.Scale(zero=False)),
alt.Color('fselector__k:N', title='No. of features')
).facet(column = 'knn__p')
The above plot shows a clear elbow at 11 neighbors with eucelidian distance after which performance dropes significantly. Also it is clearly seen that performance with all features is really bad compared to 10 and 20 features.
We see from the results that 10 features used to get the best results using the F score method.
These features are:
feature_selection_plot(D_train, t_train, 10, 'f_score')
import joblib
joblib.dump(gs_pipe_KNN.best_estimator_, 'best_KNN.pkl', compress = 1)
gs_pipe_KNN.best_estimator_
# Loading the KNN best estimator from file
m1_knn = joblib.load('best_KNN.pkl')
knn_pred = m1_knn.predict(D_test)
plot_confusion_matrix(t_test, knn_pred);
Checking for overfitting by fitting the model to the train data
knn_pred_train = m1_knn.predict(D_train)
plot_confusion_matrix(t_train, knn_pred_train);
Performance drops a bit on the Test data. Indicating perhaps some overfitting exists
The ROC curve via this method looks to be good
plot_roc_curve(m1_knn, D_test,"KNN with p=1 and 11 nearest neighbors using 10 features")
pipe_KNN2 = Pipeline(steps=[('rfi_fs', RFIFeatureSelector()),
('knn', KNeighborsClassifier())])
params_pipe_KNN2 = {'rfi_fs__n_features_': [10, 20, D_train.shape[1]],
'knn__n_neighbors': [1,3, 5, 9, 11, 15, 21],
'knn__p': [1, 2]}
gs_pipe_KNN2 = GridSearchCV(estimator=pipe_KNN2,
param_grid=params_pipe_KNN2,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs = -2)
D_train_np = np.array(D_train)
t_train_np = np.array(t_train)
gs_pipe_KNN2.fit(D_train_np, t_train_np);
PerformanceComparison = get_search_results(gs_pipe_KNN2)
rem=PerformanceComparison.columns[1:4]
PerformanceComparison.drop(columns=rem).head()
import altair as alt
alt.Chart(PerformanceComparison,
title='KNN Performance Comparison Features'
).mark_line(point=True).encode(
alt.X('knn__n_neighbors', title='Number of Neighbors'),
alt.Y('mean_score', title='F1 Score', scale=alt.Scale(zero=False)),
alt.Color('rfi_fs__n_features_:N', title='features')
).facet(column = 'knn__p')
Using the random forest importance for the top 20 features gave the best performance, however it can be noted that performance using top 10 and 20 features wasn't much different from first looks. Again performance using all features is poor and as such shouldn't be used.
RF_features_selection_plot(D_train, t_train, n_features = 20)
import joblib
joblib.dump(gs_pipe_KNN2.best_estimator_, 'best_KNN_RFI.pkl', compress = 1)
m2_knn = joblib.load('best_KNN_RFI.pkl')
m2_knn
knn_pred = m2_knn.predict(np.array(D_test))
plot_confusion_matrix(t_test, knn_pred);
Checking for overfitting by fitting the model to the train data
knn_pred_train = m1_knn.predict(D_train_np)
plot_confusion_matrix(t_train, knn_pred_train);
Performance drops 4% on the macro average of F1 score on the test data. It is expected that the model will have a slight drop in performance compared with the training.
plot_roc_curve(m2_knn, np.array(D_test),"KNN with p=1 and k=3 and 20 features")
ROC curve is an indicating on how well the model can predict. Since we have significant class imbalance, it is not expected that the ROC curve will be very high.
from sklearn.model_selection import cross_val_score
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=111)
cv_results_KNN = cross_val_score(estimator=m1_knn,
X=D_test,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
cv_results_KNN.round(3)
from sklearn.model_selection import cross_val_score
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=111)
cv_results_KNN_rfs = cross_val_score(estimator=m2_knn,
X=np.array(D_test),
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
cv_results_KNN_rfs.round(3)
map ={"KNN_model1": cv_results_KNN, "KNN_model2": cv_results_KNN_rfs}
df=pd.DataFrame(map)
df.head()
df.mean()
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id', 'level_1': 'model',0:'F1 score'})
# checking which model has higher median performance
import seaborn as sns
sns.boxplot(x=stacked_data["F1 score"],y=stacked_data["model"]).set_title("Cross validation of Model results");
from statsmodels.stats.multicomp import (MultiComparison)
import scipy.stats as stats
MultiComp = MultiComparison(stacked_data['F1 score'],
stacked_data['model'])
# Set up the data for comparison (creates a specialised object)
comp = MultiComp.allpairtest(stats.ttest_rel)
comp[0]
There is a significant difference when we compare the cross validated results between the 2 knn models.Model 2 with features selected with the Random Forrest importance beats the model selected with the f score importance.
The average F1 score which is a harmonic of the precision and recall hovers around 44%. This is due to class imbalance which cause low recall score. In context of the project this is since the cost of making a wrong prediction is not high. Just means the participant will need to go on more dates to meet more people.
So using the KNN model to predict match, we need to collect the following 20 features from future participants.
The best features are listed below:
RF_features_selection_plot(D_train, t_train, n_features = 20)
We will test the modelling capability of probability based machine learning model with Gaussian Naive Bayes(NB).
Naive Bayes assumes all features with gaussian distribution. Hence we'll be using a power transformed data to train the model.
Below we will define a pipleline that will take the following into account
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer
#define pipline with feature selection
pipe_NB = Pipeline([('fselector', SelectKBest()),
('NB', GaussianNB())])
#define pipeline for selector
params_pipe_NB = {'fselector__score_func': [f_classif],#, mutual_info_classif],
'fselector__k': [10, 20, D_train.shape[1]],
'NB__var_smoothing': np.logspace(0,-9, num=100)}
gs_pipe_NB = GridSearchCV(estimator=pipe_NB,
param_grid=params_pipe_NB,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
Data_transformed = PowerTransformer().fit_transform(D_train)
gs_pipe_NB.fit(Data_transformed, t_train);
search_results_NB = get_search_results(gs_pipe_NB)
rem=search_results_NB.columns[1:4]
search_results_NB.drop(columns=rem).head()
PerformanceComparison = search_results_NB[search_results_NB['fselector__k'] == 10]
alt.Chart(PerformanceComparison.drop(columns="fselector__score_func"),
title='NB Performance Comparison with 10 Features',
width = 400
).mark_line(point = True).encode(
alt.X('NB__var_smoothing:Q', title='Variance Smoothing', scale=alt.Scale(type='log', base=2),axis=alt.Axis(tickRound=True)),
alt.Y('mean_score', title='F1 Score', scale=alt.Scale(zero=False))
)
As seen in above table and plot, the best model had a variance smoothing of 0.008 with 10 features selected, there is a slight elbow after this with performance reducing after this.
### Save the model for later use
import joblib
joblib.dump(gs_pipe_NB.best_estimator_, 'best_NB1.pkl', compress = 1)
m1_NB = joblib.load('best_NB1.pkl')
m1_NB
Test_transformed = PowerTransformer().fit_transform(D_test)
NB_pred = m1_NB.predict(Test_transformed)
plot_confusion_matrix(t_test, NB_pred);
NB_pred_train = m1_NB.predict(Data_transformed)
plot_confusion_matrix(t_train, NB_pred_train);
#define pipline with feature selection
pipe_NB2 = Pipeline([('rfi_fs', RFIFeatureSelector()),
('NB', GaussianNB())])
#define pipeline for selector
params_pipe_NB2 = {'rfi_fs__n_features_': [10, 20, D_train.shape[1]],
'NB__var_smoothing': np.logspace(0,-9, num=100)}
gs_pipe_NB2 = GridSearchCV(estimator=pipe_NB2,
param_grid=params_pipe_NB2,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
gs_pipe_NB2.fit(Data_transformed, t_train);
search_results_NB2 = get_search_results(gs_pipe_NB2)
rem=search_results_NB2.columns[1:4]
search_results_NB2.drop(columns=rem).head()
PerformanceComparison = search_results_NB2[search_results_NB2['rfi_fs__n_features_'] == 20]
alt.Chart(PerformanceComparison,
title='NB Performance Comparison with 20 Features',
width = 300
).mark_line(point = True).encode(
alt.X('NB__var_smoothing:Q', title='Variance Smoothing', scale=alt.Scale(type='log', base=2)),
alt.Y('mean_score', title='F1 Score', scale=alt.Scale(zero=False))
)
### Save the model for later use
joblib.dump(gs_pipe_NB2.best_estimator_, 'best_NB2.pkl', compress = 1)
m2_NB = joblib.load('best_NB2.pkl')
Test_transformed = PowerTransformer().fit_transform(D_test)
NB_pred = m2_NB.predict(Test_transformed)
plot_confusion_matrix(t_test, NB_pred);
NB_pred_train = m2_NB.predict(Data_transformed)
plot_confusion_matrix(t_train, NB_pred_train);
No obvious overfitting since the predictions on the training data is not drastically greater on the test data.
To evaluate how the model's TPR compared with the model's FPR.
plot_roc_curve(m2_NB, Test_transformed, "NB with variance smoothing = 0.00035 and 10 features selected using RFI")
cv_results_NB1 = cross_val_score(estimator=m1_NB,
X=Test_transformed,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
cv_results_NB1.round(3)
cv_results_NB2 = cross_val_score(estimator=m2_NB,
X=Test_transformed,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
cv_results_NB2.round(3)
map ={"NB_model1": cv_results_NB1, "NB_model2": cv_results_NB2}
df=pd.DataFrame(map)
df.head()
df.mean()
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id', 'level_1': 'model',0:'F1 score'})
# checking which model has higher median performance
import seaborn as sns
sns.boxplot(x=stacked_data["F1 score"],y=stacked_data["model"]).set_title("Cross validation of Model results");
from statsmodels.stats.multicomp import (MultiComparison)
import scipy.stats as stats
MultiComp = MultiComparison(stacked_data['F1 score'],
stacked_data['model'])
# Set up the data for comparison (creates a specialised object)
comp = MultiComp.allpairtest(stats.ttest_rel)
comp[0]
Both NB models receive F1 score of around 50. This is much better than than using KNN. Model1 using f score selector returned best number of features at 10 while model 2 using random forrest selector selected 20.
On cross validation, the model 2 performed slightly better than model1 and we do see a statistically significant difference. Therefore we can say using NB, we using 20 features according to Random Forest Importance will be better for a good prediction.
These features are given below:
RF_features_selection_plot(D_train, t_train, n_features = 20)
We will use decision tree (DT) and random forrest (RF) in this section to model the data. We will use the gridsearch cross validation methods to tune within our range limit for the parameters.
We expect the decision tree based models will perform better on features selected with the rfi selector since the selector is information based.
Below we will define a pipleline that will take the following into account
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=2,
random_state=88)
#define pipline with feature selection
pipe_DT = Pipeline([('fselector', SelectKBest()),
('dt', DecisionTreeClassifier(random_state=88))])
params_pipe_DT = {'fselector__score_func': [f_classif],#, mutual_info_classif],
'fselector__k': [10, 20, D_train.shape[1]],
'dt__max_depth': range(2,20),
'dt__criterion': ['gini','entropy'],
'dt__min_samples_split': range(2,12)
}
gs_pipe_DT = GridSearchCV(estimator=pipe_DT,
param_grid=params_pipe_DT,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
gs_pipe_DT.fit(D_train, t_train);
The table for the top 5 models is given by the table below. We see there is not much difference between the mean core across each of the functions. The optimal depth is 7 and best criterion is ginni
results_DT = get_search_results(gs_pipe_DT)
results_DT.head()
Best function is given with the parameters below
gs_pipe_DT.best_estimator_
Plotting the decision tree comparisions in the graph below. We see the model have a high performance at maximum depth of 3 then drops sharply. It then have a peak around y. We see as min samples inncrease, the performance of the tree reduces slight. Although not by much.
We can potentially use a higher min samples to reduce overfitting
import altair as alt
alt.Chart(results_DT.drop(columns="fselector__score_func"),
title='Decision Tree Performance Comparison'
).mark_line(point=True).encode(
alt.X('dt__max_depth', title='Maximum Depth'),
alt.Y('mean_score', title='Mean CV Score', scale=alt.Scale(zero=False), aggregate='average'),
color= 'dt__min_samples_split:N'
).facet(column='dt__criterion')
Performance for different min_samples_split's was nearly the same for both criterion.
Dump the best estimator to a file and load the best estimator as m1_DT
import joblib
joblib.dump(gs_pipe_DT.best_estimator_, 'best_DT.pkl', compress = 1)
m1_DT = joblib.load('best_DT.pkl')
m1_DT
pred= m1_DT.predict(D_test)
plot_confusion_matrix(t_test, pred);
pred_Train = m1_DT.predict(D_train)
plot_confusion_matrix(t_train, pred_Train);
We see that there is 11% difference between the F1 score for 'match'. This is an indication of overfitting on the training data. We need to look into ways to control over fitting of the model if it is chosen to be the best style of model for the speed dating data
plot_roc_curve(m1_DT, D_test,"Decision Tree")
#Creating a model for comparaison
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=111)
best_DT = cross_val_score(estimator=m1_DT,
X=D_test,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
best_DT.round(3)
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=2,
random_state=88)
#define pipline with feature selection
pipe_DT = Pipeline([('rfi_fs', RFIFeatureSelector()),
('dt', DecisionTreeClassifier(random_state=88))])
params_pipe_DT = {'rfi_fs__n_features_': [10],#, 20, D_train.shape[1]],
'dt__max_depth': range(3,12),
'dt__criterion': ['gini','entropy'],
'dt__min_samples_split': range(2,5)
}
gs_pipe_DT_rfi = GridSearchCV(estimator=pipe_DT,
param_grid=params_pipe_DT,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
gs_pipe_DT_rfi.fit(D_train_np, t_train_np);
gs_pipe_DT_rfi.best_estimator_
results_DT = get_search_results(gs_pipe_DT_rfi)
results_DT.head()
import altair as alt
alt.Chart(results_DT,
title='Decision Tree Performance Comparison'
).mark_line(point=True).encode(
alt.X('dt__max_depth', title='Maximum Depth'),
alt.Y('mean_score', title='Mean CV Score', scale=alt.Scale(zero=False), aggregate='average'),
color= 'dt__min_samples_split:N'
).facet(column='dt__criterion')
Dump the best estimator to a file and load the best estimator as m3_DT
import joblib
joblib.dump(gs_pipe_DT_rfi.best_estimator_, 'best_DT_rfi.pkl', compress = 1)
m2_DT = joblib.load('best_DT_rfi.pkl')
m2_DT
D_test_np = np.array(D_test)
pred= m2_DT.predict(D_test_np)
plot_confusion_matrix(t_test, pred);
pred_Train = m2_DT.predict(D_train_np)
plot_confusion_matrix(t_train, pred_Train);
The model selected using the RFI selector controls over fitting better. The difference between training and test F1 score for 'match' is 7%. This is a significant improvement on the training score of the f score feature selected model. Hence by using the RFI selector, it appears to make the model less pron to overfitting.
plot_roc_curve(m2_DT, D_test_np,"Decision Tree rfi")
#Creating a model for comparaison
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=111)
best_DT2 = cross_val_score(estimator=m2_DT,
X=D_test_np,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
best_DT2.round(3)
The data comes with preprocessed bined categorical data. Research should be going into each column to determine how each categories have been binned. Additional research into understanding the psycology of dating to create meaninful binned numbers would help to improve the model. Here we are assuming the binning have done by industry experts who put these binns in meaninful order
Since the categories are binned, there is a natural order to the information, hence it would be appropriate to use label encoder.
Additionally because we want to select the most important features to understand what are the most important criterias for match. Hence it is important to keep the features together.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
data_cat = df_d.copy()
for col in df_d:
data_cat[col] = le.fit_transform(df_d[col])
data_cat = preprocessing.MinMaxScaler().fit_transform(data_cat)
#Fit test train split to categorical data
from sklearn.model_selection import train_test_split
D_train2, D_test2, t_train2, t_test2 = \
train_test_split(data_cat, target, test_size = 0.3,
stratify=target, shuffle=True, random_state=88)
cv_method = RepeatedStratifiedKFold(n_splits=5,
n_repeats=2,
random_state=88)
#define pipline with feature selection
pipe_DT = Pipeline([('fselector', SelectKBest()),
('dt', DecisionTreeClassifier(random_state=88))])
params_pipe_DT = {'fselector__score_func': [f_classif],#, mutual_info_classif],
'fselector__k': [10, 20, D_train.shape[1]],
'dt__max_depth': range(2,10),
'dt__criterion': ['gini','entropy'],
'dt__min_samples_split': range(2,12),
}
gs_pipe_DT_cat = GridSearchCV(estimator=pipe_DT,
param_grid=params_pipe_DT,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
gs_pipe_DT_cat.fit(D_train2, t_train2);
results_DT = get_search_results(gs_pipe_DT_cat)
results_DT.head()
import altair as alt
alt.Chart(results_DT.drop(columns="fselector__score_func"),
title='Decision Tree Performance Comparison'
).mark_line(point=True).encode(
alt.X('dt__max_depth', title='Maximum Depth'),
alt.Y('mean_score', title='Mean CV Score', scale=alt.Scale(zero=False), aggregate='average'),
color= 'dt__criterion:N'
)
import joblib
joblib.dump(gs_pipe_DT_cat.best_estimator_, 'best_DT_cat.pkl', compress = 1)
m3_DT = joblib.load('best_DT_cat.pkl')
m3_DT
pred= m3_DT.predict(D_test2)
plot_confusion_matrix(t_test2, pred);
pred_Train = m3_DT.predict(D_train2)
plot_confusion_matrix(t_train2, pred_Train);
Between the train and test results, binning seems to have reduced some of the overfitting.
from sklearn.model_selection import cross_val_score
cv_method_ttest = RepeatedStratifiedKFold(n_splits=5,
n_repeats=5,
random_state=111)
best_DT3 = cross_val_score(estimator=m3_DT,
X=D_test2,
y=t_test2,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
best_DT3.round(3)
map ={"DT_num_f_score": best_DT,
"DT_num_rfi": best_DT2,
"DT_cat": best_DT3}
df=pd.DataFrame(map)
#box plot
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id', 'level_1': 'model',0:'F1 score'})
# checking which model has higher median performance
import seaborn as sns
sns.boxplot(x=stacked_data["F1 score"],y=stacked_data["model"]).set_title("Cross validation of Model results");
from statsmodels.stats.multicomp import (MultiComparison)
import scipy.stats as stats
MultiComp = MultiComparison(stacked_data['F1 score'],
stacked_data['model'])
# Set up the data for comparison (creates a specialised object)
comp = MultiComp.allpairtest(stats.ttest_rel, method='Holm')
print (comp[0])
Using categorical data have actually made the model's predicting ability worsen. This could be due to binning, we have actually reduced the information available in the data. Hence making it predict worse.
So although binning the information could have benefits of improving the ease of collecting the data, and increasing the speed to implement the model. It is done at a cost of accuracy. This needs to be considered
We will rule out using categorical features in our modelling process since it performs signficantly worse than using numerical features.
There is not signfiicant evidence that decision tree with features selected with the f score method is significantly different from model selected via the rfi method. Both uses 10 features. The features are listed below:
RF_features_selection_plot(D_train, t_train, n_features = 10)
Random forrest is an ensemble of decision trees. Hence we would expect it to outperform decision tree models. There are a number of criterions we can select the models. Here we are picking just a few to maximise the performance.
In line with the other models, we will be using the F1 score as the score to maximise. This is because of our imbalance of targets.
pipe_RF = Pipeline([('fselector', SelectKBest()),
('rf', RandomForestClassifier(random_state=88))])
params_pipe_RF = {'fselector__score_func': [f_classif],#, mutual_info_classif],
'fselector__k': [10, 20, D_train.shape[1]],
'rf__n_estimators':[100,200,500,1000],
'rf__criterion': ['gini','entropy']}
gs_pipe_RF = GridSearchCV(estimator=pipe_RF,
param_grid=params_pipe_RF,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
gs_pipe_RF.fit(D_train, t_train);
Below are the top 5 results from the Random forrest selector selecting with the f_score function
results_RF = get_search_results(gs_pipe_RF)
results_RF.head()
Visualising the results in the chart below. We see a peak at 500 estimators. And entropy criterion outperforms the gini. To further tune the model, there are other cross validation methods we should be employing. However due to time constraint we are able to only test the overall cross validation on a few.
alt.Chart(results_RF.drop(columns="fselector__score_func"),
title='Random Forest Performance Comparison'
).mark_line(point=True).encode(
alt.X('rf__n_estimators', title='Maximum number of estimators'),
alt.Y('mean_score', title='Mean CV Score', aggregate='average', scale=alt.Scale(zero=False)),
color= 'rf__criterion:N'
)
joblib.dump(gs_pipe_RF.best_estimator_, 'best_RF.pkl', compress = 1)
m1_RF = joblib.load('best_RF.pkl')
m1_RF
Below we will define a pipleline that will take the following into account
pipe_RF = Pipeline([('rfi_fs', RFIFeatureSelector()),
('rf', RandomForestClassifier(random_state=88))])
params_pipe_RF2 = {'rfi_fs__n_features_': [10, 20, D_train.shape[1]],
'rf__n_estimators': [50,100,300,500]}
gs_pipe_RF2 = GridSearchCV(estimator=pipe_RF,
param_grid=params_pipe_RF2,
cv=cv_method,
scoring='f1',
verbose=1,
n_jobs=-2)
gs_pipe_RF2.fit(D_train_np, t_train_np);
results_RF2 = get_search_results(gs_pipe_RF2)
results_RF2.head()
alt.Chart(results_RF2,
title='Random Forest Performance Comparison'
).mark_line(point=True).encode(
alt.X('rf__n_estimators', title='Maximum number of estimators'),
alt.Y('mean_score', title='Mean CV Score', aggregate='average', scale=alt.Scale(zero=False)),
color= 'rfi_fs__n_features_:N'
)
From the plot we can see a peak at 300 estimators. ALso the models have a better accuracy using 10 feature ovall compared with other models
Best model criteria is selected below
joblib.dump(gs_pipe_RF2.best_estimator_, 'best_RF2.pkl', compress = 1)
m2_RF = joblib.load('best_RF2.pkl')
m2_RF
pred= m2_RF.predict(D_test_np)
plot_confusion_matrix(t_test, pred);
cv_method_ttest
#Creating a model for comparaison
best_RF1 = cross_val_score(estimator=m1_RF,
X=D_test,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
best_RF1.round(3)
#Creating a model for comparaison
best_RF2 = cross_val_score(estimator=m2_RF,
X=D_test_np,
y=t_test,
cv=cv_method_ttest,
n_jobs=-2,
scoring='f1')
best_RF2.round(3)
We will copare the cross validated results of:
with the results we gained from the two information based models:
We will then compare the results from 5 times 5 fold cross validation to determine which type of model performs best.
map ={"KNN": cv_results_KNN_rfs, "NB": cv_results_NB2, "DT": best_DT, "DT2": best_DT2, "RF1": best_RF1,"RF2":best_RF2}
df=pd.DataFrame(map)
df.head()
stacked_data = df.stack().reset_index()
stacked_data = stacked_data.rename(columns={'level_0': 'id', 'level_1': 'model',0:'F1 score'})
# checking which model has higher median performance
import seaborn as sns
sns.boxplot(x=stacked_data["F1 score"],y=stacked_data["model"]).set_title("Cross validation of Model results");
from statsmodels.stats.multicomp import (MultiComparison)
import scipy.stats as stats
MultiComp = MultiComparison(stacked_data['F1 score'],
stacked_data['model'])
#converting this multiple comparison paired t test result into a nice data frame :)
comp = MultiComp.allpairtest(stats.ttest_rel, method='Holm')
compa=pd.DataFrame(comp[0])
compa=compa.iloc[1:,0:4]
compa = compa.rename(columns={0: 'c', 1: 'c1',2:'stat',3:'pval'})
compa['pval']=compa['pval'].astype(str).astype(float)
compa['stat']=compa['stat'].astype(str).astype(float)
compa.loc[compa["pval"]<=0.05,"result"]= "reject"
compa.loc[compa["pval"]>0.05,"result"]= "fail to reject"
compa.loc[compa["result"]=="reject",:]
compa["pair"]=compa["c"].astype(str)+ " - " +compa["c1"].astype(str)
compa.drop(columns=["c","c1"],axis=1)
compa = compa[['pair', 'pval', 'result']]
all_comp_ttest_models = compa.copy()
All pairs that are significantly different from each other shown below:
all_comp_ttest_models
In this project we wanted to find
In the final comparison of
We find that Probability based model Gaussian naïve Bayes (NB) perform best. We see from the paired t-test and box plot that the cross validation results is significantly better than other models. We find with the probability model, selecting 20 features with random forrest selector in this preliminary study have yielded the best resultsl.
While we did have high hopes for the information based model, we found that it under performed. When categorical data is applied we found the model performed worse on prediction but better on overfitting. From this result we can infer that although binning the data into categories can make the data gathering process simpler, it comes at a cost of accuracy.
Finally we can conclude for the project that speed dating match results are best modelled with Probability based models with top 20 features.
RF_features_selection_plot(D_train, t_train, n_features = 20)
We can see from the categories, liking the other party comes first followed by funny, mutual attractiveness and shared interest.
It is interesting to note that intelligence, sincerety, age difference, same race, matter much less in contributing to a match result. Although these attributes may be essential for long term relationships, it shows that at the end, humans are still emotional creatures. Being funny and being attractive can gain attention and good rapport fast. Much faster than other attributes.
This is probably why most people write in their dating app profile that they want someone who makes them laugh?
Since we have identified probability based models outperform other model types in modelling this speed dating data. We can improve upon the accuracy of predictions by:
Feature engineering - combine, or augment some of the top selection features under supervision of human behaviour specialist to creat better features from the responses.
Apply more advanced probability based models.
Explore ways to deal with class imbalance in the data. In this study we only managed class imbalance by selecting F1 score as the optimising score since we are only trying to determine the best type of model and resulting features. More advanced techniques can be used to deal with class imbalance when training the model. This can be implemented in the next step.
The above three points are just some of the steps to further optimise speed dating modeling results. This project is just a beginning. There are more future research we could do.
Speed Dating datset from OpenML
Code taken from RMIT MATH2319 Tutorial on SK5
Mutiple comparison ttest from : https://pythonhealthcare.org/2018/04/13/55-statistics-multi-comparison-with-tukeys-test-and-the-holm-bonferroni-method/